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Abstract 

Background: Genome-wide association studies (GWAS) have identified many common polymorphisms associated 
with complex traits. However, these associated common variants explain only a small fraction of the phenotypic 
variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" 
heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample 
size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) 
technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often 
cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more 
efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and 
then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the 
plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled 
sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be 
hard to distinguish from low frequency alleles. 

Results: A unified approach for estimating minor allele frequency, SNP calling and association studies based on 
pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This 
approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, 
number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele 
frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of 
observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give 
an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling 
approach, EM-SNP, for pooled sequencing data based on the EM algorithm is then developed and compared with 
another recent SNP calling method, SNVer. We show that EM-SNP outperforms SNVer in terms of the fraction of 
db-SNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to 
study the association between variants and type I diabetes. 

Conclusions: The EM-based approach for the analysis of pooled sequencing data can accurately estimate minor 
allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially 
useful for studies involving rare variants. 
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Introduction 

Finding genomic variants associated with complex traits 
is one of the most important problems in modern geno- 
mics. Genome-wide association studies (GWAS) based 
on common variants have been the dominant approach 
to achieve this objective [1]. However, the genomic var- 
iants identified in GWAS studies often explain only a 
small portion of the phenotypic variation related to heri- 
table human diseases, a phenomenon known as "missing 
heritability" in genomics literature [2] . This missing herit- 
ability problem has led to increasingly skeptical views of 
the common disease-common variant (CD-CV) hypoth- 
esis which predicts that common disease-causing alleles, 
or variants, will be found in all human populations that 
manifest a given disease. On the other hand, interest in 
studies on rare variants with minor allele frequencies less 
than 1% is growing [3,4]. 

Studies of rare variants are complicated by the low 
minor allele frequencies of rare variants. The develop- 
ment of next generation sequencing (NGS) technologies 
such as Illumina and Roche 454 has made it possible to 
sequence a large number of reads economically. Despite 
such important progress, sequencing a large number of 
individuals separately is still costly for most biological 
laboratories. One frequently adopted approach to reduce 
sequencing cost in the search of rare variants is pooled 
sequencing, where mixtures of genetic materials from 
several individuals are grouped together to form a pool 
for a single sequencing. While this design greatly lowers 
the sequencing cost, it also makes it hard to distinguish 
true genetic polymorphisms from sequencing errors, esti- 
mate minor allele frequencies at the polymorphic loci, 
and perform association studies on the rare variants. 

Several research groups have used pooled sequencing to 
detect rare variants that are associated with complex traits 
such as retinitis pigmentosa, diabetes, cancer, and inflam- 
matory bowel disease [5-8]. There are generally two types 
of pooling designs. One is pooling of tagged samples with 
each individual tagged by a unique short barcode. In this 
design, the genomic origins of the reads can be identified. 
However, barcoding many individuals and distinguishing 
these barcodes from each other can still be a challenging 
task. The second type of pooling is to mix the genetic 
materials from different individuals without tagging, and 
then generate reads from the mixture of genetic materials 
using NGS. With this design, the identities of the indivi- 
duals from whom the reads originate cannot be identified. 
In this paper, we concentrate on the second type of pool- 
ing design. 

Several groups have developed SNP calling methods 
based on pooled sequencing data [7,9-12]. Out et al.[7] 
modeled the number of sequencing errors as a Poisson 
random variable and identified SNPs by comparing the 
number of minor alleles within the reads with the 



Poisson distribution. For rare variants with minor allele 
frequencies similar to or lower than the sequencing 
error rate, this approach could miss many true variants 
if the pool size is relatively large. Druley et al. [9] devel- 
oped a SNP identification method, SNPSeeker, that can 
be applied to large pools by using control sequences 
without SNPs. In many studies, control sequences may 
not be available, making this approach impractical. Also, 
the program can only be used to analyze Illumina data. 
Bansal et «/.[10] developed a method called CRISP to 
identify rare variants by comparing the minor allele fre- 
quencies across multiple pools using contingency tables. 
It was shown that CRISP outperforms SNPSeeker in 
terms of accuracy, but CRISP is more computationally 
demanding. Altmann et al. [13] improved the computa- 
tional speed of CRISP and identified SNPs as the variants 
with different minor allele frequencies across at least two 
pools. Wei et al. [11] developed a statistical tool, called 
SNVer, for variant identification. For each pool, SNVer 
first defined a p-value by testing the hypothesis that the 
minor allele frequency is above a given threshold and 
then combined the p-values for individual pools to give 
an overall p-value using Simes methods as in [12]. This 
algorithm makes it possible to rank the observed variants 
so that the top ranked ones are more likely to be true 
SNPs. However, the algorithm needs a pre-specified 
sequencing error rate which can be difficult to do 
because the sequencing error rates can be position 
dependent. In the above studies, the investigators are 
mainly concerned with the detection of SNPs; they do 
not aim to estimate minor allele frequencies. 

In order to estimate minor allele frequencies in pooling 
studies, several groups developed statistical models for 
the sampling of individuals and the sampling of reads 
from the individuals in the pools [14,15]. These studies 
assumed a pre-defined constant sequencing error rate 
across different loci. However, sequencing error rates can 
vary for different loci depending on the nucleotide con- 
tents of the surrounding genomic regions. The effects of 
mis-specifying the sequencing error rate on minor allele 
frequency estimation, SNP detection and power of asso- 
ciation studies are not clear. Using similar models, Lee 
et al. [16] studied an optimal design in pooling studies. 
This involved the number of individuals in each pool and 
the number of pools. Recently, Chen et a/. [17] considered 
more complex issues such as uneven sampling of indivi- 
duals, different coverage of the minor and major alleles 
due to either PCR amplification or reads mapping, and 
reads quality scores. 

In this paper, we develop new methods for estimating 
minor allele frequencies, SNP detection, and association 
studies using pooled sequencing data based on the mod- 
els in [14-16]. In contrast to methods developed in pre- 
vious studies, we do not assume that the sequencing 
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error rate is constant. Instead, we estimate the sequen- 
cing error rate for each position together with the minor 
allele frequency based on the minor and major allele 
counts for all the pools using an expectation-maximiza- 
tion (EM) algorithm. We show that the naive estimation 
of the allele frequency by the fraction of minor alleles in 
the reads can be significantly inflated, especially for rare 
variants, while the EM approach can give an unbiased 
estimate of the minor allele frequency in all situations we 
studied. The estimation accuracy of the EM algorithm 
increases with the number of reads and the number of 
pools, but decreases with the number of chromosomes in 
each pool. Based on the allele frequency estimation, we 
develop a SNP calling method, EM-SNP, and an associa- 
tion test using likelihood ratio statistics. The likelihood 
ratio statistic used in EM-SNP is then used to rank candi- 
date polymorphic loci to determine if they are true poly- 
morphisms. Using a real re-sequencing dataset, we show 
that, for rare variants with minor allele frequencies lower 
than 1%, the fraction of dbSNPs (http://www.ncbi.nlm. 
nih.gov/projects/SNP/) among the SNPs called by EM- 
SNP is higher than that of SNVer. Similarly, the transi- 
tion/transversion ratio of rare variants called by EM-SNP 
is higher than that of SNVer. These observations show 
that EM-SNP outperforms SNVer at calling rare variants 
with minor allele frequencies less than 1%. 

Materials and methods 

Notation 

Consider a locus along the genome. Let /be the fre- 
quency of the minor allele (denoted as "1"), and 1 -/be 
the frequency of the major allele (denoted as "0") in a 
population. We also consider the following potential 
sequencing error rates: 

P(l|0)=o, P{0\l) = p. 

Assume that a total of G pools of individuals are 
sequenced and each pool contains K/2 individuals (K 
chromosomes). For each pool g, a total of n g reads are 
mapped to this locus, with «o x reads carrying the major 
allele and n i s reads carrying the minor allele. Thus 
n g = n 0g + n lg . 

Let C be the number of chromosomes carrying the 
minor allele among the K chromosomes in a pool. Then 
C follows a binomial distribution, i.e 

P{C = k] = Bm{k;K,f) = (*)/ fe (l -f) K ~ k , 

k = 0, 1,2,- ■ -,K. 

Conditional on C = k, the probability that a sequence 
read covering a variant carries the minor allele is 



Thus, the probability of observing the data for the g-th 
pool is, 

K 

Pg{n 0g ,n lg \f iai p) = J2^ n ini s ;n g ,p k )Bm{k;K,f). (1) 

Since the pools can be considered independent, the 
likelihood of observing the data for all the pools is 

G 

L{f,<*.P) = Y[P g {n 0g ,n lg \f,a,p). 
«=i 

Given the above likelihood expression and the data 
{{n Qg , ni g ),g = 1,2, • ■ -, G}, our objectives are as follows 

• Find the maximum likelihood estimate of (/ a, f5). 

• Determine whether an observed variant is a true 
SNP or not, i.e. SNP calling. 

• Find the variants associated with a phenotype of 
interest. 



Computational methods 

An expectation-maximization (EM) approach for allele 
frequency estimation 

Based on the likelihood function, an approximate solu- 
tion to the maximum likelihood estimation of the para- 
meters can be obtained using the EM algorithm. We 
consider the following missing data: 

• C g : the number of chromosomes carrying the 
minor allele in the g-t\\ pool; 

• I gi : the true underlying allele state (major (0) or 
minor (1)) of read i in the g-th pool; 

• r gi : the observed allele state (major (0) and minor 
(1)) of the «-th read in the g-th pool. 

We also use the following notation: 



c = 


E c s> 




Tg = 






Tn 


- 2^ g =i Z-,m 




Tio 




- hi)' 


Toi 


. y a y (i - 




Too 


= y G y" s (i - 


W - hi) 
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Based on the above notation, the complete log-likeli- 
hood is: 

\ogP{rf l C gl nx g ,n 0g ,g = I,- ■ ;G\f,a, p) 
= Clog/+(GK-C)log(l-/) 

G 



E 



log 



«AWf1 -S*' 



+ (ng-TH log (1 



r (x) I I T (x 

J 01 / \M0 

T U log(l-/8) 



+ T«log f(2) 



+ Tio log p + Toi log a + Too log(l - a). 

Suppose that the value of 0 = (f, a, [3) at the t-th 
iteration is 0 W = (f (t \ a (t \ /3 W ). The maximization (M)- 
step gives: 



GxK' 
(t+i) _ £ (t)( T oi) 



E(t)( T oi) + %)(T 0 o)' 
= £(t)(Tio) 
£(t)(Tio)+E w (Tii)' 



Note the expectation E( t ) is taken when the parameters 
are at 0 W . 

The expectation (E)-step is formulated as follows: 



£ {t) (q|Data) 

K 
k 



ELM!)/ fe (i-/r fe ( n n M(^r(i-^r (3) 



and 

G 

E (t) (C| Data) = ^£ (t) (C g | Data), 
«=i 



(4) 



where all the parameters in the equations are of the 
values taken at the t-th step. 

From Equations 3 and 4, we are able to obtain the 
recursive formula for / 

Next we calculate £ w (r u |Data). Note that 

I Data ) 

Em> H 1 ~ g)g^K l,pfc)Bin(fe;K,/) 
P(K,"iJ) 

which does not depend on i, and we denote it as 
E{I g r g \[n 0g ,ni g )). The denominator P[[n 0g , n lg )) can be 
calculated from Equation 1. Thus, 



% (Tu | Data) = n g E(I g r g \{n 0g , n lg )). 



(5) 



Similarly, we can derive the formulas for £( t )(T 10 | 
Data), £ (t) ( T 0 i| Data) and £( t )(r 0 o|Data), and the recur- 
sive formulas for a and fi can be derived from them. 
SNP identification using EM 

Due to sequencing errors, the observed variants may 
contain a significant amount of false positives, i.e. loci 
that are not truly polymorphic. Thus, before testing for 
associations with phenotypes, we need to determine the 
true polymorphic sites. This step is especially important 
in the case of rare variants since the sequencing error 
rates for NGS could be close to or even higher than the 
minor allele frequencies. 

Consider a case-control study with a group of case 
individuals and another group of control individuals. Let 
fx and / 0 be the minor allele frequencies at a locus 
among the cases and controls, respectively. Denote f = 
(/o,/i) and 0 = (0, 0). We can test if an observed variant 
is a true SNP using the likelihood ratio test for H 0 :f 0 = 
fx = Ovs.Hx :f 0 * 0 oif * 0: 



H 0 1 1 2 1 2 

A = 2(? f/0 - lf=o) ~ -h + -Xi + 4X2* 



(6) 



where If is the maximum log-likelihood of the 
observed data for both the cases and the controls. Note 
that the null hypothesis f = 0 is on the boundary of the 
region of the parameters of interest. Therefore, the 
asymptotic distribution of A is ij 0 + jXi + \xi when 
the number of pools is large according to [18], where I 0 
is the point mass at 0 and x, 2 , i = 1, 2 are the chi- 
square distributions with i degrees of freedom. When 
the number of pools is relatively small, simulation 
approaches for the null distribution of A are needed to 
obtain the asymptotic distribution. 

We can also test if an observed variant is a true SNP 
using cases or controls separately. For the control pools, 
we conduct a likelihood ratio test for H 0 : f 0 = 0 vs. : 
f 0 > 0. Similarly, we replace f 0 by fx for the case pools. 
We then use the statistic 



H 0 1 1 

Aj = 2(l fi¥0 - l fi=0 ) ~ -J 0 + -Xi 2 , 



i= 1,2, 



(7) 



to test each hypothesis, where lf t and lf 0 are the maxi- 
mum log-likelihood of the data for the cases and con- 
trols, respectively. Because the null hypothesis/- = 0 is 
on the boundary of parameter region ^ > 0, the statistic 

A, has an asymptotic distribution jl 0 + jXi when the 
number of pools is large according to [18]. We refer to 
the above method for SNP identification as EM-SNP. 
Testing for associations between a SNP and a phenotype 
in case-control studies 

We test if a SNP is associated with a phenotype of inter- 
est using the likelihood ratio test again. Here we test the 
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alternative hypothesis H 1 : fi * f 0 versus the null hypoth- 
esis H 0 :fi= f 0 . This association test is conducted by the 
likelihood ratio test statistic: 

A = 2 [/(unrestricted / 0 ,/i, a, fi) - /(restricted /, a, p)\ ~ Xi- ( 8 ) 

This statistic has an asymptotic chi-square distribution 
with 1 degree of freedom. 

Simulation studies 

We use simulations to evaluate our approaches for allele 
frequency estimation, SNP detection and test for asso- 
ciation. A large range of parameter space is considered 
to see how different parameters affect the performance 
of our methods. These parameters include minor allele 
frequency (/), sequencing error rate (a), the number of 
chromosomes in each pool (K), the number of pools (G) 
and the relative risk for a disease (A). 
Pooled data generation 

In our simulations, we set a = fi and choose four starting 
values ofa = P = 0.05%, 0.1%, 0.5%, 1% corresponding to 
different sequencing error rates ranging from low to high. 
The sequencing error rate for current NGS technologies is 
around 1% and we expect that it will decrease as the tech- 
nologies improve. Therefore, we also consider much lower 
sequencing error rate in our studies. For the allele fre- 
quency, we choose four values /= 0.1%, 0.5%, 1%, 5%. Loci 
with minor allele frequencies above 5% are considered to 
be common. We want to study if it is possible to estimate 
the minor allele frequency even if it is lower than the 
sequencing error rate. We let the read number n = 1000 
and 3000, which is around the sequencing depth in [8]. To 
study the effect of the number of chromosomes, we let K = 
50, 100, 200. The number of pools is set at G = 10, 20, 50. 

Since the sequencing error rate can vary from locus to 
locus and from one pool to another, we generate 1000 a, 
(= Pi, i = 1, • ■ •, 1000) values from a normal distribution 
with mean equal to the starting values of a, and variance 
equal to 0.1 times the starting values. Finally, we generate 
1000 pooled data sets with each combination of the five 
parameters (K, G, n,f, a) based on a;(= /},), i = 1, 1000. 
Measuring the accuracy of the allele frequency estimation 
For each of the 4x4x2x3x3 = 288 combinations, 
we do the following: 

1. In the 2-th simulation, we use the EM algorithm 
derived above to estimate the parameters (f, a, fi), 
denoted as {f^,<Xi,%}- We a ^ so consider a naive 
estimate for the minor allele frequency as the frac- 
tion of observed minor alleles in the observed reads, 

f© = S=l ni ' (9) 
2-^=1 n & 



2. Repeat Step 1) for R = 1000 times. 

3. Compute the mean squared error (MSE) of f em 

and / avg from the true population minor allele fre- 
quency f, 

1 R 

MSE(f em ) = -£(/S-/) 2 , 

i=i 

1 R 

MSE{f avg ) = -J2(fi^-f) 2 - 



4. Compute the MSE of f em and / avg from the frac- 
tion of chromosomes carrying minor alleles /f rac = 
CI{KG) in the pools, 

1 R 

Cg[fem) = -^^-/frac) 2 , 

!=1 
1 R 

^[favg) = -^^-/frac) 2 . 

1=1 

We use both MSE and Cg to compare the accuracy of 
the EM algorithm with the naive approach of estimating 
/by the fraction of reads carrying the minor allele. 
Generating case-control data to study the power of SNP 
identification and association studies using EM 
In order to evaluate the power of SNP identification 
using EM-SNP and test for association, we simulate 
case-control data as follows. When generating the con- 
trol data, we assume that the minor allele frequency is/ 
and that the locus is under Hardy- Weinberg equili- 
brium. When generating the genotypes of the case indi- 
viduals, we assume that the penetrance (the probability 
an individual is affected) of the genotypes 00, 01 and 11 
are go = 0.01, Ago. and X 2 g 0 , respectively. In our simula- 
tions, we choose A = 1.2, 2, and 4. 

We can use the case or control samples separately or 
combine them for SNP detection as in the "SNP identi- 
fication using EM" subsection. For example, we consider 
both the cases and controls jointly. The log-likelihood 
ratio statistic A (or A ; if we use case or control samples 
separately) defined in equation 6 is used to test if an 
observed variant is a true polymorphic locus or not. For 
a given type I error J (= 0.05 in our study), we claim 
that the variant is truly polymorphic if A >t T where t 7 is 
the threshold corresponding to type I error y. If the 
threshold can not be found theoretically, we can do 
parametric bootstrap to find the threshold. Firstly, 
assume that the variant site is not polymorphic, estimate 
the allele frequency and error rate using the maximum 
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likelihood approach. Secondly, generate the reads data 
as in our model a large number of times (R = 1000), 
and obtain the empirical distribution of A. For a given 
type I error y, we find the upper y percentile t r Finally, 
the null hypothesis is rejected if the value of A is at 
least t r For a given relative risk A, we repeat these steps 
1000 times and the power is the fraction of times that 
the locus is called as polymorphic. 

Similar approaches can be used to study the power of 
association studies using the pooling design. For details, 
see additional file 1. 

A pooled sequencing data set related to type 1 diabetes [8] 

We use our method to study the pooled sequencing data 
related to type 1 Diabetes dataset (T1D) in [8] and com- 
pare the results with current methods for SNP identific- 
tion [11]. The data was generated using DNA samples of 
480 T1D patients and 480 healthy controls, arranged in 
20 DNA pools, with 48 patients/controls in each pool. 
Roche 454 sequencing was used to sequence 144 target 
regions that cover exons and splice sites of 10 candidate 
genes. We use MOSAIK (http://bioinformatics.bc.edu/ 
marthlab/Mosaik) to map the reads to the human refer- 
ence genome (hgl9) with parameters -hs 15 -p 12 -mmp 
0.05 -act 26 - mhp 100 -bw 51 as recommended in its 
documentation. MOSAIK is a widely used reference- 
guided assembler that hashes the whole reference gen- 
ome and locate information in the hash table using a 
'jump database' [19-21]. Then we use SAMtools (http:// 
samtools.sourceforge.net/) [22] to pileup the reads onto 
the target regions. We also remove indels and keep loci 
that are covered by at least one read in each pool. Finally, 
we use ANNOVAR [23] to annotate the identified SNPs. 

Results 

We first present our results on the effects of various 
parameters on the estimation accuracy of the minor 
allele frequency using the EM algorithm. We then pre- 
sent the results on the power of SNP detection and 
association studies. Finally, we present our results on 
the analysis of the data in [8] using the approaches in 
the "Materials and Methods" section. 

The effects of minor allele frequency, sequencing error 
rate, number of individuals in the pools and number of 
pools on the accuracy of allele frequency estimation 

We compare our EM estimate f em with the naive estimate 

/ avg for minor allele frequency / Table 1 gives a brief sum- 
mary of the comparisons between these two methods. 
Each cell corresponds to the number of scenarios that the 
mean squared error (using either MSE or Cg) of f em 

exceeds / avg . It shows that f em consistently outperforms 



Table 1 Comparison of f em and / avg in terms of mean 
squared error 





a = 


0.05% 


a = 


0.1% 


a = 


0.5% 


a = 


1% 


MSE 
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MSE 


eg 


MSE 


eg 


MSE 


eg 


f = 0.1% 


0 


0 


0 


0 


0 


0 


0 


0 


f = 0.5% 


9 


5 


4 


3 


0 


0 


0 


0 


f = 1% 


13 


10 


9 


7 


0 


0 


0 


0 


f= 5% 


17 


16 


17 


16 


12 


12 


7 


7 



Number of scenarios where MSE em > MSE awg or Cg em > Cg avg out of 18 total 
scenarios for each cell. 



/ avg whenever /< 0.1% or if< 1%, a > 0.5%), which covers 

the typical situations of rare variant studies under current 
NGS technologies. Moreover, the advantage of the EM 
method increases as allele frequency /decreases and as 
sequencing error rate a increases, which is reasonable 
since it becomes more difficult for a naive estimate such 

as / av g to distinguish true minor alleles from sequencing 

errors as allele frequency decreases and sequencing error 
rate increases. On the other hand, the EM method shows 
greater superiority since it is specifically designed for the 
purpose. However, when the sequencing error rate is very 
low, for example, less than one out of 2000 and /> 1%, the 
simple naive estimation method works reasonably well. 

Figure 1 gives an example of a common pooled sequen- 
cing setting of a start = 1%, n = 3000, K = 100, G = 10, and 
a minor allele frequency of/= 1%. The upper left panel 

shows that / avg suffers from an evident over-estimation 
of both /and/f rac , while f em appears to be an unbiased 
estimate of / The upper right panel shows the histogram 
of f em over 1000 simulations. The lower left panel shows 
the box plot of / avg -/ frac and f em _/ frac , respectively. It 

shows that f em — f iac centers around 0, which suggests 
that the variance of /f rac might be responsible for the 
majority of the variance of f em . The bar plot of the MSE 

for both / avg and f em as estimates of / and /f rac in the 
lower right panel quantitatively demonstrates the super- 
iority of f em over / avg in terms of their mean squared 
errors. 

The relative errors of / avg and f em in estimating minor 
allele frequency f 

We measure the bias of an estimator by the relative error 
(RE) defined as RE = 100 x \f — f\/f, where f is the 
mean of the estimates off across all replications. The log 
values of the RE of all 288 simulations for both / avg and 

f em are given in Figure SI of the additional file 1. The 
figure shows that the number of reads n in each pool, the 
number of chromosomes K and the number of pools G 
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MSE^=mean((U-V-)*) 



'avg 'frac 



MSE a . 



MSE el 



Cg, 



f= 19 



Cg., 



Figure 1 Comparison of f em and / av g- An example for the comparison of performances between J em and jf av g, where 

n = 3000, K = 100, and G = 10. The two box plots on the left are a comparison between jf av g and J em as an estimate of f and ff mc , where 

/avg snows a considerable tendency of over-estimation compared to f em - The upper right histogram shows how J em deviates from fin 1000 
simulations, which appears to be roughly unbiasedly distributed around f. The lower right bar plot is a summary plot of the MSE for these two 
methods as estimates of f and f frao showing superiority of f gm both as an estimator of f and of f frac , in terms of MSE and Cg. 



have little effect on the RE of f avg , while the allele fre- 
quency / and the sequencing error rate a play a dominant 
role in affecting RE. To further explore their effects, we 
demonstrate the average effects of /and a on the RE by 



computing the average RE based on the values of/and a 
across all different («, G, K) in Table 2. It is interesting to 

observe that fixing a, the RE of ^ avg decreases linearly 
with/ ; while fixing/, the RE of / avg increases linearly 



Table 2 Comparison of / 


and /avg 


in terms of average relative error 










a = 0.05% 


a = 


0.1% 


a 


= 0.5% 


a = 


1% 


f 


RE? 

Javg 


RE f 

J em 


RE f 

7 avg 


RE> 

J em 


RE> 

Javg 


RE; 

J em 


RE; 

J avg 


RE; 

J em 


0.1% 


52.0 


9.4 


102.0 


15.6 


502.0 


72.0 


1 000.0 


146.0 


0.5% 


10.3 


4.0 


20.2 


4.9 


99.5 


13.3 


199.0 


26.5 


1% 


5.0 


3.3 


10.0 


3.7 


49.2 


5.7 


98.3 


9.5 


5% 


1.0 


5.4 


1.9 


6.0 


9.1 


6.7 


18.1 


6.3 



The average RE of f and f for different values of minor allele frequency f and sequencing error rate a. 
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with a. Thus, / avg suffers most severely in the case of rare 
polymorphisms and high sequencing error rate. It can be 
seen from Table 2 that the RE of f em in estimating minor 
allele frequency /is significantly lower compared to that 
of / avg for rare polymorphisms at/< 1%. 

Next we present our results for the effects of (K, G, n, 
f, a) on the estimation accuracy of minor allele fre- 
quency / using the EM algorithm. We note that the esti- 
mation of J3 = /(0|1) is highly unreliable (data not 
shown). This phenomenon can be explained as follows. 
When minor allele frequency /is low, the expected 
number of chromosomes having the minor allele in 
each pool is also low. When the number of pools G is 
small, the estimation of fi can be difficult with a small 
number of chromosomes carrying the minor allele. 
Thus, we do not show detailed results on the estimation 
of /?. Despite the fact that fi cannot be reliably esti- 
mated, the other two parameters / and a can be reliably 
estimated using the EM approach. 

The effects of minor allele frequency f and sequencing error 
rate a on the estimation accuracy of f 

To study the effects of minor allele frequency / and 
sequencing error rate a on the estimation accuracy of f em , 
as an estimator of both / and /f ra „ we fix (K, G, n) = (100, 
10, 3000). The histograms of f em under each combina- 
tions of / and a are shown in Figure S2 of the additional 
file 1. We observe that f em is roughly unbiasedly distribu- 
ted around / but the variance of f em as an estimator of/ is 
relatively large. The source of this variance, however, is 
largely due to the variance of/f rac , rather than the algo- 
rithm itself, as shown in Figure S3 of the additional file 1, 
where the histograms of f em — / frac is tightly distributed 
around 0, with the majority of the variance shown in 
Figure S2 of the additional file 1 disappeared. This is an 
explicit evidence that the variance of f em consists mostly 
of the variance of / rac . Thus, f em might serve better as an 
estimator of/f rac than of / We also observe as a general 
trend that f em appears to be a roughly unbiased estimator 
for both /and / rac , and its variance appears to be affected 
less by a but significantly by / This observation is also 
confirmed in Table 3 where MSE's and Cg's for different 
combinations of /and a are shown. 

To reduce the effect of a few outliers of f em on the 
MSE and Cg calculation, we also modified the defini- 
tions of MSE and Cg by removing the top and bottom 
k% of its values and recalculate the values of MSE and 
Cg. The results on the modified measures are presented 
in additional file 1 and the qualitative results on the per- 
formance of f em continue to hold. 



Table 3 / as an estimator of f or f 1rac 



f 


a = 


0.05% 


a = 


0.1% 


a = 


0.5% 


a = 


1% 


MSE 


eg 


MSE 


eg 


MSE 


eg 


MSE 


eg 


0.1% 


9.8e-7 


4.3e-8 


9.7e-7 


1 ,2e-7 


3.2e-6 


2.8e-6 


1.1 e-5 


1.1 e-5 


0.5% 


5.5e-6 


1 ,9e-7 


5.4e-6 


1 ,9e-7 


67e-6 


1 ,5e-6 


1.0e-5 


5.8e-6 


1% 


1 ,2e-5 


8.6e-7 


1 .2e-5 


9.6e-7 


1 ,3e-5 


2.3e-6 


1.6e-5 


5.6e-6 


5% 


5.5e-5 


1 ,3e-5 


5.9e-5 


1 7e-5 


8.3e-5 


4.3e-5 


1.0e-4 


7.0e-5 



The mean squared errors (MSE) and Cg's of j em as an estimator for f or f frac 
for various combinations of f and a, (K, G, n) = (100, 10, 3000). 



We also studied the effects of (/<", G, n) on the estima- 
tion accuracy of f em and the details of the simulation 
results are given in additional file 1. It was observed 
that the accuracy increases with G and n as expected. 
However, the accuracy decreases with the number of 
individuals K in each pool. 

Results on the power of SNP calling using the likelihood 
ratio test 

We next study the effects of (K, G, n,f, a) on the power 
of SNP detection using the likelihood ratio approach for 
the case and the control samples, respectively. The 
number of reads in each pool («) is set at either 1000 or 
3000 as in the above simulations. We start from default 
values of the parameters {K, G, n, / a) = (100, 20, 1.2, 
1%, 0.5%). Then we change one of these parameters and 
keep all the other parameters at default. Figure 2 shows 
the results for such a study and the results for using 
case and control samples together are given in the addi- 
tional file 1 in Figure S8. 

It can be seen from Figure 2 that at a type I error rate 
of 0.05, the power is consistently well above 0.9 in all 
scenarios demonstrated here except for the extremely 
rare variant case of / = 0.1%. The power tends to 
increase with the minor allele frequency or the number 
of pools, while it decreases with sequencing error rate 
or number of individuals in each pool. The power also 
increases with the number of reads in each pool. We 
observe that the power using the case samples is slightly 
higher than that using the control samples. This obser- 
vation can be explained by the fact that the frequency of 
the minor allele in the cases is higher than that in the 
controls, resulting in higher power of SNP detection. 

Results on the power of association studies using the 
likelihood ratio test 

We also study the effects of (K, G, n,f, a) on the power of 
detecting associations between SNPs and phenotypes 
using the likelihood ratio approach for the case and con- 
trol samples together. The parameter setting is similar to 
that in the above subsection except that here we also let 
the relative risk A to be 1.2, 2, and 4, respectively. Figure 3 
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shows how each parameter affects the power of detecting 
the association. 

It can be seen from Figure 3 that at a type I error rate 
of 0.05, the power increases with X and approaches 1 as 
X goes up to 4, which happens in all scenarios demon- 
strated here except for the extremely rare variant sce- 
nario of / = 0.1%. The power increases with allele 
frequency, pool size or number of pools, while it seems 
robust with respect to changes in sequencing error rate. 

Results on the analysis of the type 1 diabetes data in [8] 
Allele frequency estimation and SNP calling in the control 
samples 

We apply our approaches to analyze the pooled sequen- 
cing data in [8]. First, we conduct SNP calling using 
both our method EM-SNP and SNVer (parameter set- 
ting -bq 20 -a 0 -f 0 -p 1 -t 0) [11], a program that has 
been shown to outperform several other programs for 
SNP calling including CRISP, SAMtools, and GATK. 
Unlike many previous programs calling variants as SNPs 
or not, SNVer ranks variants according to their potential 



of being true SNPs using the p-value defined in the pro- 
gram. As a likelihood ratio test, EM-SNP can also rank 
the variants by the magnitude of the likelihood ratio. 
The estimated allele frequency spectrum of the top 100 
called SNPs by either EM-SNP or SNVer is given in Fig- 
ure S9 of the additional file 1. Note that 5 variants have 
dominant non-reference alleles and they are excluded 
from both lists for a fair comparison. In the SNVer list, 
we also exclude the variants that are removed in the 
preprocessing stage of EM-SNP. Both frequency spec- 
trums of the variants called by EM-SNP and SNVer 
tend to concentrate on the lower frequency range. 
Evaluation of the SNP calling results 

A standard approach to evaluate the effectiveness of a 
SNP calling method is to compare the fraction of 
dbSNPs [24] among the top ranked SNPs, defined as the 
dbSNP ratio. The rationale is that if a SNP calling 
method is reasonable, it should be able to detect the 
SNPs that are already in the dbSNP database because 
these SNPs have been reported in previous studies. 
Therefore, a higher dbSNP ratio indicates potentially 
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better performance of the SNP calling method. Figure 4 
shows the cumulative dbSNP ratio of the top 100 called 
variants whose minor allele frequencies are less than a 
threshold. To further demonstrate the effect of minor 
allele frequency on the performance of EM-SNP and 
SNVer, we also show the dbSNP ratio in different win- 
dows of minor allele frequencies in Figure S10 of the 
additional file 1. 

In terms of the dbSNP ratio for the top 100 called 
variants, EM-SNP consistently outperforms SNVer 
under all allele frequency thresholds, and EM-SNP dis- 
plays significant superiority especially for low frequency 
variants. In Table S7 of the additional file 1, we give an 
example of the dbSNP ratio among the top 100 SNP 
calls with f em < 0.2% for the two methods. Using a total 
of 480 control samples, EM-SNP identifies variants with 
minor allele frequency less than 0.2% with a high 
dbSNP ratio, which serves as an evidence of its superior 
performance in rare variant scenarios. On the other 
hand, the upper left panel of Figure S10 in the addi- 
tional file 1 shows that the relative performance of EM- 



SNP and SNVer based on dbSNP ratio depends on 
minor allele frequency. EM-SNP detects more rare var- 
iants and has higher dbSNP ratio at minor allele fre- 
quency less than 1%. Whereas this relative performance 
of EM-SNP and SNVer is reversed for minor allele fre- 
quency above 1%. Thus, EM-SNP is most useful in 
detecting rare variants. 

Another criterion to evaluate SNP calls is the transi- 
tion-transversion (Ti/Tv) ratio. It is well known that 
transitions are much more frequent than transversions 
in evolution, and the number of transitions over the 
number of transversions, referred to as Ti/Tv ratio, in 
known SNPs is expected to be between 2 and 4 [25]. 
Figure 4 shows that EM-SNP gives a consistently higher 
Ti/Tv ratio throughout the entire allele frequency range 
for both the whole set of called variants and the novel 
set. For the known variants, the Ti/Tv ratio trends of 
the two methods are similar to each other. Table S8 in 
the additional file 1 gives an example of the Ti/Tv ratio 
among the top 100 SNP calls with/ EM < 0.2% by EM- 
SNP and SNVer. The effect of minor allele frequency on 
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the relative performance of EM-SNP and SNVer in 
terms of Ti/Tv ratio is similar to that in terms of 
dbSNP ratio (Figure S10 in the additional file 1). 

We also consider the top 150 ranked SNPs and the 
corresponding figures and tables are shown as Figure 
S11-S12 and Tables S7-S8 in the additional file 1. The 
qualitative conclusions are the same. 

Identifying SNPs associated with type 1 diabetes 

We then study the association of the identified variants 
with type 1 diabetes (T1D). We first look at the com- 
mon SNPs with estimated minor allele frequencies 
above 1% in the controls as in [8] and want to see if we 
can identify the common SNPs associated with T1D. 
With the estimated allele frequencies, we can estimate 
the numbers of minor alleles in the controls and in the 
cases separately. Based on the estimated counts, we 
obtain a preliminary p-value based on the Fisher's exact 
test as in [8]. However, the p-value obtained this way is 
not accurate as it does not consider the variation in 



estimating the allele frequencies using the EM algo- 
rithm. Therefore, we then use the likelihood ratio statis- 
tic defined in equation 8 to calculate another p-value 
that is given in the last column in Table 4, where we 
only list SNPs with preliminary p-value less than 10" . 
The p-value obtained through the likelihood ratio test 



Table 4 Association results 



SNP 


Gene 


fo 


"o 


fl 


"i 


Fisher's p- 


EM p- 










value 


value 


rs3 184504 


SH2B3 


0.52 


499 


0.41 


394 


1 .9e-6 


8.4e-7 


rs7076103 


IL2RA 


0.19 


178 


0.10 


93 


4.5e-8 


2.7e-7 


rs2476601 


PTPN22 


0.09 


86 


0.16 


151 


8.1 e-6 


9.2e-6 



Testing for association between common SNPs f-f^ > \%} and T1D with 



Fisher's preliminary p-value at most e-5. In the table, and are the 
estimated minor allele frequencies using the EM algorithm in the controls and 

cases, respectively, and H, |^960 X /jj , / = 0, 1. The Fisher's p-value is the 

preliminary p-value using the Fisher's exact test as in [8] and the EM p-value 
is the p-value calculated based on the likelihood ratio statistic in equation 8. 
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reflects the true p-value better because it takes the var- 
iation in estimating the allele frequency into account. 

The SNP rs3 184504 residing within gene SH2B3 has 
an EM p-value of 8.4e - 7. This SNP was also found to 
be associated with a preliminary p-value of 5e - 7 in the 
original study [8] and the corresponding gene was pre- 
viously identified to be associated with T1D. The frac- 
tions of observed minor alleles in controls and cases in 
the original study were 53% and 41.5%, respectively, and 
our mapping results are consistent with the estimates. 
The EM estimated minor allele frequencies in the con- 
trols and cases are 52% and 41%, which are slightly 
smaller than the observed values since we took sequen- 
cing errors into account. The SNP rs7076103 within the 
gene IL2RA has a preliminary Fisher's p-value of 4.5e-8 
and an EM p-value of 2.7e-07. This SNP was not 
reported to be associated with any phenotypes according 
to the catalog of GWAS studies (http://www.genome. 
gov/gwastudies/) to date. Nevertheless, other SNPs 
within the IL2RA gene were found to be associated with 
T1D by Cooper et al. [26] before the publication of [8] 
and by two other studies [27,28] after the publication of 
[8] using different approaches. Barrett et al. [27] used 
genome-wide association studies giving a p-value of 
1.0e-13 while Huang et al. [28] used imputation of the 
genotypes based on the 1000 genome projects yielding a 
p-value of 5e-9. These new studies support our signifi- 
cant result on the association of rs7076103 with T1D. 
The estimated minor allele frequencies in the controls 
and cases were 18.8% and 14.8% in [8] giving a p-value 
of 0.02 which is not significant after adjusting for multi- 
ple testing. This example shows that even for common 
polymorphisms, the EM approach can help to find likely 
associations that the naive approach can not. The SNP 
rs2476601 was found to be associated with T1D in sev- 
eral studies before the publication of [8] and was con- 
firmed in a recent study in [29] that was published after 
the publication of [8]. All these studies support our 
results for the association of common polymorphisms 
with T1D. 

For rare polymorphisms (f 0 < 1%) within the con- 
trols, we first use the naive approach described above to 
obtain a preliminary Fisher's p-value for every SNP. Due 
to the low minor allele frequencies of the rare variants, 
none of the p-values is smaller than 0.001. We did not 
pursue the association of individual variants with T1D 
further. 

Discussion 

In this paper, we developed an EM algorithm based uni- 
fied approach for minor allele frequency estimation, 
SNP calling and association studies, applicable to pooled 
sequencing data where genetic materials of multiple 



individuals are pooled together. This study differs from 
previous studies in that we estimate sequencing error 
rate for each position while previous studies generally 
assume a pre-specified sequencing error rate across all 
sequenced regions. Since sequencing error rate depends 
on the genomic context, it is essential that the sequen- 
cing error rate be estimated specifically for different 
loci. In a pooling design without tagging, the origin of 
the reads is not known, and it is impossible to obtain 
the individual genotypes from the pooled data. There- 
fore, we modelled the pooled sequencing data as a 
"missing value" problem and designed an EM algorithm 
to estimate the minor allele frequency and sequencing 
error rate. 

We first studied the effects of minor allele frequency, 
sequencing error rate, number of pools, number of indi- 
viduals in each pool, and the sequencing depth in each 
pool, on the estimation accuracy of the minor allele fre- 
quency. It was shown that the naive approach, which 
estimates the minor allele frequency by the fraction of 
observed minor alleles in the reads, can significantly 
over-estimate the true minor allele frequency, and that 
the effect is most severe for rare variants. The EM based 
algorithm, on the other hand, can estimate the minor 
allele frequency in a relatively unbiased manner. 
Although the variation of this estimation seems to be 
relatively large, a major part of the variation comes from 
the sampling of individuals from the population rather 
than the algorithm itself. We also show that the estima- 
tion accuracy of the EM algorithm increases with the 
number of pools and sequence depth as expected. How- 
ever, the estimation accuracy decreases with the number 
of individuals in each pool, most likely because a more 
extensive pooling induces greater loss of information. 
Secondly, we used a likelihood ratio statistic based on 
the estimated parameters from EM to call SNPs. With 
the real data from [8], in terms of the dbSNP ratio, we 
showed that EM-SNP outperforms SNVer for rare var- 
iants with minor allele frequency less than 1%. We also 
showed that the transition/transversion ratio of the 
called SNPs for rare variants based on EM-SNP is 
higher than that of the called SNPs by SNVer. These 
two independent pieces of evidence demonstrate that 
EM-SNP is superior to SNVer in the discovery of rare 
variants. However, the extent of this advantage decreases 
as minor allele frequency increases due to the tradeoff 
between EM-SNP's bias adjustment for the estimation 
of minor allele frequencies and extra variation intro- 
duced in the EM algorithm. Finally, we applied our 
approach to reanalyze the case-control data from [8] 
and showed that we can find the associated common 
SNPs. Unfortunately we did not find any significantly 
associated rare variants. One possible explanation is that 
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the power of finding rare variants associated with com- 
plex traits is generally low as a consequence of the low 
frequencies of minor alleles. 

We made several simplifying assumptions in our 
study. First and foremost, we did not consider errors 
introduced by mapping the reads to the reference gen- 
ome. The mapping of Roche 454 data still has many 
challenges, in particular, in regions around homopoly- 
mers, and further development of algorithms for map- 
ping is needed. Secondly, although we assumed that the 
amount of genetic materials from each individual is the 
same for each pool, this assumption can be violated. To 
overcome this problem, one approach would assume 
that the fractions of genetic materials from individuals 
follow a Dirichlet distribution [17]. Thirdly, the called 
SNPs by EMSNP still have many false positives since 
the Ti/Tv ratio for the called novel SNPs is still low 
compared to the known SNPs. Further improvements in 
SNP calling are needed. Finally, the computational 
speed of the EM based approach can be relatively slow, 
and the method cannot be applied to whole genome 
association studies although this is not a problem for 
targeted sequencing studies as in [8]. These are the 
topics for future research. 

Software 

Software can be downloaded from http://www-rcf.usc. 
edu/~fsun/Programs/EM-SNP/EM-SNP.html. 

Additional material 



Additional file 1: Supplementary materials. Supplementary methods 
and results. 
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